Automatic Structuring of Written Texts
نویسندگان
چکیده
This paper deals with automatic structuring and sentence boundary labelling in natural language texts. We describe the implemented structure tagging algorithm and heuristic rules that are used for automatic or semiautomatic labelling. Inside the detected sentence the algorithm performs a decomposition to clauses and then marks the parts of text which do not form a sentence, i.e. headings, signatures, tables and other structured data. We also pay attention to the processing of matched symbols in the text, especially to the analysis of direct speech notation.
منابع مشابه
Automatic identification of language varieties: The case of Portuguese
Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. Stateof-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classifica...
متن کاملLearner Engagement with Structuring and Problematizing in Scaffolded Writing Tasks: A Mixed-MethodsMultiple Case Study
The present study set out to delineate to what extentfive intermediate learners engaged in structuring and problematizing scaffolding in two writing tasks. The study aimed at illuminating how the participants engaged with structuring and problematizing scaffolds cognitively, behaviorally, and affectively. Learners’ written essays, think-aloud protocols, and interviews shaped the data sources w...
متن کاملTools for Terminology Processing
Automatic terminology processing appeared 10 years ago when electronic corpora became widely available. Such processing may be statistically or linguistically based and produces terminology resources that can be used in a number of applications : indexing, information retrieval, technology watch, etc. We present the tools that have been developed in the IRIN Institute. They all take as input te...
متن کاملHwæt! LOL! – common formulaic functions in Beowulf and blogs
We consider the functions that formulae perform in two types of written texts which maintain close links to oral forms: Old English epic poetry and blogs. Five oralityrelated functions of formulae are identified in both datasets: discourse-structuring, filler, epithetic, gnomic, and tonic. A sixth type of formulaic function, theacronymic, necessarily tied to the written medium, is also found in...
متن کاملA Database of Freely Written Texts of German School Students for the Purpose of Automatic Spelling Error Classification
The spelling competence of school students is best measured on freely written texts, instead of pre-determined, dictated texts. Since the analysis of the error categories in these kinds of texts is very labor intensive and costly, we are working on an automatic systems to perform this task. The modules of the systems are derived from techniques from the area of natural language processing, and ...
متن کامل